104 research outputs found
GumDrop at the DISRPT2019 Shared Task: A Model Stacking Approach to Discourse Unit Segmentation and Connective Detection
In this paper we present GumDrop, Georgetown University's entry at the DISRPT
2019 Shared Task on automatic discourse unit segmentation and connective
detection. Our approach relies on model stacking, creating a heterogeneous
ensemble of classifiers, which feed into a metalearner for each final task. The
system encompasses three trainable component stacks: one for sentence
splitting, one for discourse unit segmentation and one for connective
detection. The flexibility of each ensemble allows the system to generalize
well to datasets of different sizes and with varying levels of homogeneity.Comment: Proceedings of Discourse Relation Parsing and Treebanking
(DISRPT2019
Adpositional Supersenses for Mandarin Chinese
This study adapts Semantic Network of Adposition and Case Supersenses (SNACS) annotation to Mandarin Chinese and demonstrates that the same supersense categories are appropriate for Chinese adposition semantics. We annotated 20 chapters of The Little Prince, with high interannotator agreement. The parallel corpus substantiates the applicability of construal analysis in Chinese and gives insight into the differences in construals between adpositions in two languages. The corpus can further support automatic disambiguation of adpositions in Chinese, and the common inventory of supersenses between the two languages can potentially serve cross-linguistic tasks such as machine translation
Mixture Proportion Estimation Beyond Irreducibility
The task of mixture proportion estimation (MPE) is to estimate the weight of
a component distribution in a mixture, given observations from both the
component and mixture. Previous work on MPE adopts the irreducibility
assumption, which ensures identifiablity of the mixture proportion. In this
paper, we propose a more general sufficient condition that accommodates several
settings of interest where irreducibility does not hold. We further present a
resampling-based meta-algorithm that takes any existing MPE algorithm designed
to work under irreducibility and adapts it to work under our more general
condition. Our approach empirically exhibits improved estimation performance
relative to baseline methods and to a recently proposed regrouping-based
algorithm
Recommended from our members
Overview of AMALGUM – Large Silver Quality Annotations across English Genres
Corpus resources for Linguistics and NLP research on discourse phenomena, such as coreference and discourse trees, are limited by a lack of large scale, well-understood, annotated datasets: corpora are either very large (100M-10G tokens) but shallowly annotated and with unknown composition, or richly annotated, but smaller. Here, we present a resource that takes a middle path, combining some of the best features of scraped corpora - size, open licenses, lexical diversity - and high quality curated data for more interpretable inferences with complex annotations
Recommended from our members
Reversible Interlayer Sliding and Conductivity Changes in Adaptive Tetrathiafulvalene-Based Covalent Organic Frameworks.
Ordered interlayer stacking is intrinsic in two-dimensional covalent organic frameworks (2D COFs) and has strong implications on COF's optoelectronic properties. Reversible interlayer sliding, corresponding to shearing of 2D layers along their basal plane, is an appealing dynamic control of both structures and properties, yet it remains unexplored in the 2D COF field. Herein, we demonstrate that the reversible interlayer sliding can be realized in an imine-linked tetrathiafulvalene (TTF)-based COF TTF-DMTA. The solvent treatment induces crystalline phase changes between the proposed staircase-like sql net structure and a slightly slipped eclipsed sql net structure. The solvation-induced crystallinity changes correlate well with reversible spectroscopic and electrical conductivity changes as demonstrated in oriented COF thin films. In contrast, no reversible switching is observed in a related TTF-TA COF, which differs from TTF-DMTA in terms of the absence of methoxy groups on the phenylene linkers. This work represents the first 2D COF example of which eclipsed and staircase-like aggregated states are interchangeably accessed via interlayer sliding, an uncharted structural feature that may enable applications such as chemiresistive sensors
GENTLE: A Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation
We present GENTLE, a new mixed-genre English challenge corpus totaling 17K
tokens and consisting of 8 unusual text types for out-of domain evaluation:
dictionary entries, esports commentaries, legal documents, medical notes,
poetry, mathematical proofs, syllabuses, and threat letters. GENTLE is manually
annotated for a variety of popular NLP tasks, including syntactic dependency
parsing, entity recognition, coreference resolution, and discourse parsing. We
evaluate state-of-the-art NLP systems on GENTLE and find severe degradation for
at least some genres in their performance on all tasks, which indicates
GENTLE's utility as an evaluation dataset for NLP systems.Comment: Camera-ready for LAW-XVII collocated with ACL 202
Findings of the Shared Task on Multilingual Coreference Resolution
This paper presents an overview of the shared task on multilingual
coreference resolution associated with the CRAC 2022 workshop. Shared task
participants were supposed to develop trainable systems capable of identifying
mentions and clustering them according to identity coreference. The public
edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used
as the source of training and evaluation data. The CoNLL score used in previous
coreference-oriented shared tasks was used as the main evaluation metric. There
were 8 coreference prediction systems submitted by 5 participating teams; in
addition, there was a competitive Transformer-based baseline system provided by
the organizers at the beginning of the shared task. The winner system
outperformed the baseline by 12 percentage points (in terms of the CoNLL scores
averaged across all datasets for individual languages)
- …